Course: Applied Geo-Data Science at the University of Bern (Institute of Geography)
Supervisor: Prof. Dr. Benjamin Stocker
Adviser: Dr. Koen Hufkens, Pepa Aran, Pascal Schneider
Further information: https://geco-bern.github.io/agds/
Do you have questions about the workflow? Contact the Author:
Author: Bigler Patrick (patrick.bigler1@students.unibe.ch)
Matriculation number: 20-100-178
Reference: Report Exercise 5 (Chapter 8)
Modeling is an important aspect of data science. But there are different types of models with different levels of complexity. With a realistic problem, the following goals should be trained:
Understand the basics of regression models and their classification.
Fit linear and logistic regression models in R.
Choose and calculate relevant model performance metrics.
Evaluate and compare regression models.
Detect data outliers.
Select best predictive variables.
In environmental and climate science, dozens of variables have been measured. With a model, we want to try to predict a target. For that, we use a linear model. Unfortunately, we do not know how many variables and which variables we should use. We only know that we try to find a formula like this:
\[ \hat{y}=\hat{a}_0 +\sum_{n=1}^{N}\hat{a}_n*x_n \]
In this formula, \(\hat{a}_0\) is the estimated intercept, \(\hat{a}_n\) is the n-th estimated coefficient, and \(x_n\) is the n-th variable. With this formula, we want to estimate the target \(\hat{y}\) as accurately as possible. For that, we calculate \(R^2\) which is a metric that allows us to estimate how much of the total variance is explained by the model. If we add an additional variable to the model, then R always increases for mathematical reasons.
Unfortunately, this does not necessarily mean that the model becomes better. We still need another quality measure, which is why we implement the Akaike Information Criterion (AIC). The model with the minimum AIC is often the best model for generalizing new data. The AIC is based on log-likelihood and a correction term that takes into account the number of variables. In the beginning, AIC becomes smaller. When the influence of the correction term becomes too large (because we added variables), the AIC does not get smaller anymore, and we have a basis for decision-making for the model choice.
In summary, we can say that \(R^2\) increases with each additional variable. To account for the complexity of the model, we also observe the AIC. If it stops decreasing, we have found our most likely best multivariate linear model.
To build up the best linear regression model, we could, of course, simply try all combinations by permutation. But that would be a waste of resources. However, we cannot avoid a try-and-error approach. But we would like to reduce the number of combinations. We implement an algorithm that works as follows:
Let \(p\in[1,N]\) and a predictor. Let \(N\in[1,numbers\:of\:variables)]\). N could be all the variables we have or only a subset of them. Our algorithm will decide that (at the start, we do not know how many variables we will use).
The algorithm takes p and creates a linear model. The p with the highest RSQ will be used as a predictor. We delete the predictor from our list because we can use it only once.
The algorithm takes p+1 and creates a linear model. The p+1 with the highest RSQ will be used as a predictor. We select the model with the highest RSQ and calculate the AIC.
Iterate step 3 until the AIC of the model does not decrease anymore.
We have found the (presumably) optimal model.
For the first task, we apply only the first 2 steps because we want only a bivariate model. For the second task, we iterate step 3 till the AIC does not decrease anymore.
The open-source program R-Studio (Version 2022.12.0+353) was used for all studies. The data handling in this program is done using the R language. We utilize the R-package “Tidyverse” and other packages to process the data effectively.
The following code chunk contains all packages we need. Important is the package “conflicted”. It enables us to choose if different functions have the same call, but do not make the same thing (a function in a certain package can have the same name as a function from another package). In this case, we will set preferences.
# Load the file
source("../../R/general/packages.R")
First, we downloaded the data and saved it in our repository. But we must do this only once. If the file exists, then we can read it directly. That is why we implemented an if-else statement.
# We want to know, if a certain file already exists
name.of.file <- "../../data/re_stepwise/FLX_CH_stepwise.csv"
# If do not exists such a file, create it!
if (!file.exists(name.of.file)){
# Access to the data
url <- "https://raw.githubusercontent.com/geco-bern/agds/main/data/df_for_stepwise_regression
.csv"
# Read the data directly from an URL
database.S1 <- read.table(url, header = TRUE, sep = ",")
# Write a CSV file in the repository
write_csv(database.S1,"../../data/re_stepwise/FLX_CH_stepwise.csv")
# Read the file
database <- read_csv("../../data/re_stepwise/FLX_CH_stepwise.csv")
# If exists such a file, read it only!
}else{database <- read_csv("../../data/re_stepwise/FLX_CH_stepwise.csv")}
To make a good decision about the predictors, we need as much information as possible. For that, we read our file as a data frame. After that, we want an overview of the data. What variables contain the data set? What are the main statistical parameters? What about the missing values? We can see that long-wave incoming radiation has a lot of missing values. This could be a problem. Our model will probably not chose this variable because \(R^2\) will be low. But if we had all the data, it might be different.
# Check wheater there are any missing values (We try with only the first six rows)
head(apply(database, 2, function(x) is.na(x)))
## siteid TIMESTAMP TA_F SW_IN_F LW_IN_F VPD_F PA_F P_F WS_F TA_F_MDS
## [1,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [2,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [3,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [4,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [5,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [6,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## SW_IN_F_MDS LW_IN_F_MDS VPD_F_MDS CO2_F_MDS PPFD_IN GPP_NT_VUT_REF USTAR
## [1,] FALSE TRUE FALSE FALSE TRUE TRUE FALSE
## [2,] FALSE TRUE FALSE FALSE TRUE TRUE FALSE
## [3,] FALSE TRUE FALSE FALSE TRUE FALSE FALSE
## [4,] FALSE TRUE FALSE FALSE TRUE FALSE FALSE
## [5,] FALSE TRUE FALSE FALSE TRUE FALSE FALSE
## [6,] FALSE TRUE FALSE FALSE TRUE FALSE FALSE
# Load the function
source("../../R/general/function.visualize.na.values.R")
# Function call to visualize the NA
visualize.na.values.without.groups(database, c("Problem with long-wave radiation!"))
# Take an overview and main statistic parameters
summary(database)
## siteid TIMESTAMP TA_F SW_IN_F
## Length:23011 Min. :1996-01-01 Min. :-29.184 Min. : 0.156
## Class :character 1st Qu.:2002-12-01 1st Qu.: 0.431 1st Qu.: 52.933
## Mode :character Median :2007-02-15 Median : 7.286 Median :120.679
## Mean :2006-10-25 Mean : 6.950 Mean :137.633
## 3rd Qu.:2011-01-23 3rd Qu.: 13.313 3rd Qu.:215.243
## Max. :2014-12-31 Max. : 31.166 Max. :398.192
##
## LW_IN_F VPD_F PA_F P_F
## Min. :138.1 Min. : 0.000 Min. : 80.37 Min. : 0.000
## 1st Qu.:265.0 1st Qu.: 0.831 1st Qu.: 84.31 1st Qu.: 0.000
## Median :300.1 Median : 2.542 Median : 97.38 Median : 0.010
## Mean :295.8 Mean : 3.787 Mean : 93.47 Mean : 2.320
## 3rd Qu.:328.9 3rd Qu.: 5.239 3rd Qu.: 98.77 3rd Qu.: 1.749
## Max. :420.1 Max. :33.290 Max. :103.28 Max. :186.400
##
## WS_F TA_F_MDS SW_IN_F_MDS LW_IN_F_MDS
## Min. :0.106 Min. :-29.184 Min. : 0.156 Min. :138.1
## 1st Qu.:1.805 1st Qu.: 0.446 1st Qu.: 53.317 1st Qu.:269.5
## Median :2.387 Median : 7.356 Median :120.031 Median :303.3
## Mean :2.670 Mean : 6.987 Mean :137.482 Mean :299.4
## 3rd Qu.:3.286 3rd Qu.: 13.301 3rd Qu.:215.580 3rd Qu.:332.9
## Max. :9.928 Max. : 31.166 Max. :398.192 Max. :420.1
## NA's :325 NA's :341 NA's :11057
## VPD_F_MDS CO2_F_MDS PPFD_IN GPP_NT_VUT_REF
## Min. : 0.000 Min. :297.3 Min. : -0.249 Min. :-6.683
## 1st Qu.: 0.852 1st Qu.:372.1 1st Qu.: 91.040 1st Qu.: 0.802
## Median : 2.612 Median :383.7 Median :231.956 Median : 2.845
## Mean : 3.850 Mean :385.1 Mean :271.714 Mean : 3.503
## 3rd Qu.: 5.326 3rd Qu.:395.6 3rd Qu.:430.028 3rd Qu.: 5.600
## Max. :33.290 Max. :724.2 Max. :995.106 Max. :18.511
## NA's :768 NA's :290 NA's :4574 NA's :2426
## USTAR
## Min. :0.0749
## 1st Qu.:0.2850
## Median :0.3856
## Mean :0.4255
## 3rd Qu.:0.5244
## Max. :1.5941
## NA's :2468
Now we want to chose the best variable for our predictor. For that, we tried all variables and chose the one with the highest RSQ (see introduction). We can see that the chosen variable does not have the lowest AIC. But here, we are talking about a bivariate linear regression, and the AIC is not reliable anymore. We use \(R^2\) instead.
# Access the outsourced function for the bivariate model
source("../../R/re_stepwise/function.bivariate.model.R")
# Function call. 16 is the column-number of the target
tibble.bi.model <- model.fitter(database, 16)
knitr::kable(tibble.bi.model, align = c("l","l", rep("c", 2)), format = "html",
caption = paste('<b> Table 1: Overview of the bivariate models</b>', ""))|>
kable_classic()
| Target | Predictor | RR | AIC | Fit |
|---|---|---|---|---|
| GPP_NT_VUT_REF | siteid | 0.0458181 | 46735.32 | NO |
| GPP_NT_VUT_REF | TIMESTAMP | 0.0047922 | 47597.89 | NO |
| GPP_NT_VUT_REF | TA_F | 0.3871016 | 37619.26 | NO |
| GPP_NT_VUT_REF | SW_IN_F | 0.4306405 | 36102.41 | NO |
| GPP_NT_VUT_REF | LW_IN_F | 0.1676259 | 43919.98 | NO |
| GPP_NT_VUT_REF | VPD_F | 0.1908666 | 43337.05 | NO |
| GPP_NT_VUT_REF | PA_F | 0.0002403 | 47691.83 | NO |
| GPP_NT_VUT_REF | P_F | 0.0020894 | 47653.72 | NO |
| GPP_NT_VUT_REF | WS_F | 0.0082500 | 47526.24 | NO |
| GPP_NT_VUT_REF | TA_F_MDS | 0.3871137 | 37559.45 | NO |
| GPP_NT_VUT_REF | SW_IN_F_MDS | 0.4300828 | 36065.75 | NO |
| GPP_NT_VUT_REF | LW_IN_F_MDS | 0.1209379 | 25130.95 | NO |
| GPP_NT_VUT_REF | VPD_F_MDS | 0.1811937 | 42567.59 | NO |
| GPP_NT_VUT_REF | CO2_F_MDS | 0.0295658 | 47078.98 | NO |
| GPP_NT_VUT_REF | PPFD_IN | 0.4520702 | 29689.54 | BEST FIT |
| GPP_NT_VUT_REF | GPP_NT_VUT_REF | NA | NA | NA |
| GPP_NT_VUT_REF | USTAR | 0.0000316 | 45048.57 | NO |
There are meaningful scatter-plots for bivariate lm-models, and they give us a first impression. But we want to know more about the coefficients, the intercept, the distribution, and the outliers. For that, we use summary() and plot the model (see discussion).
# Calculate the probebly best multivariate model
best.bi.lm <- lm(GPP_NT_VUT_REF~PPFD_IN, data = database)
# Load the function
source("../../R/re_stepwise/function.vis.bi.model.R")
# Plot the bivariate model
vis.bi.model(database)
# Overview about all important model parameter
# Model Call. We are also interested for the p-values
summary(best.bi.lm)
##
## Call:
## lm(formula = GPP_NT_VUT_REF ~ PPFD_IN, data = database)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.0475 -1.3255 -0.3758 1.2780 12.3295
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.72533 0.03057 23.73 <2e-16 ***
## PPFD_IN 0.01062 0.00009 117.98 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.41 on 16872 degrees of freedom
## (6137 observations deleted due to missingness)
## Multiple R-squared: 0.4521, Adjusted R-squared: 0.452
## F-statistic: 1.392e+04 on 1 and 16872 DF, p-value: < 2.2e-16
# Plot the model
plot(best.bi.lm)
What is happening with RSQ? Here we visualize the RSQ for each variable. We can easily see that PPFD_IN has the highest RSQ, and that is why we chose it as our predictor. But what if we want a multivariate regression? One might think that we should simply choose the variable with the second-highest RSQ. But we will see that this is generally not the case.
# Load the function
source("../../R/re_stepwise/function.vis.and.dev.bi.R")
# Function call
Vis_of_the_develop.bi.model(tibble.bi.model)
Here, we want to answer the question of which predictors we should choose to get the probably best linear regression model. Should we choose all of them or only a subset? For that, we have written a function. It works as described in the introduction. Now we need to consider which variables make any sense at all. This depends on the target and the context. Therefore, we have programmed the function so that it can be used flexibly. We can simply enter the variables we don’t need as column numbers. We made three function calls to demonstrate it. For the first, we use all variables but the target. This is, however, not necessarily the best way. We could say that the time factor and the location are irrelevant, so we drop these variables, too. We can see that the difference is huge. The output is an HTML table again. Every row is a regression model and contains all the variables above. We can see that the \(R^2\) with every additional variable increases a bit. The AIC decreases first and then increases again. This is exactly what we expected.
# Access the outsourced function of the multivariate model
source("../../R/re_stepwise/function.multivariate.model.R")
# Function call with all column but target (column number 16)
multi.model.2 <- multivariate.model(database, c(16))
## [1] "Best model fit has AIC = 21762.1756977506"
## [1] "PPFD_IN" "LW_IN_F" "siteid" "VPD_F" "TA_F_MDS" "SW_IN_F"
## [7] "WS_F" "CO2_F_MDS" "P_F" "PA_F"
# Function call without the long-wave variable
multi.model.3 <- multivariate.model(database, c(5,16))
## [1] "Best model fit has AIC = 21890.3981778365"
## [1] "PPFD_IN" "siteid" "TA_F_MDS" "VPD_F" "SW_IN_F" "WS_F"
## [7] "CO2_F_MDS" "P_F"
# Function call without the variables siteid, TIMESTAMP and GPP_NT_VUT_REF
multi.model.4 <- multivariate.model(database, c(1,2,16))
## [1] "Best model fit has AIC = 24469.2854126171"
## [1] "PPFD_IN" "LW_IN_F" "VPD_F" "TA_F_MDS" "SW_IN_F" "P_F"
## [7] "WS_F" "CO2_F_MDS" "PA_F"
# Overview about the models
knitr::kable(multi.model.2, align = c("l","c", "c"), format = "html",
caption = paste('<b>Table 2: Overview of the multivariate model:</b>', "Each row is a model and contains all variables above. The last row has a higher AIC as the row above and we stoped the calculation there. The final model contains all variables from the first to the second last row."))|>
kable_classic()
| Predictors | RSQ | AIC |
|---|---|---|
| PPFD_IN | 0.4520702 | 29689.54 |
| LW_IN_F | 0.5301754 | 27096.52 |
| siteid | 0.5981762 | 24464.35 |
| VPD_F | 0.6160337 | 23699.27 |
| TA_F_MDS | 0.6454925 | 22354.30 |
| SW_IN_F | 0.6513971 | 22072.88 |
| WS_F | 0.6564185 | 21830.06 |
| CO2_F_MDS | 0.6571345 | 21796.85 |
| P_F | 0.6577174 | 21770.14 |
| PA_F | 0.6579195 | 21762.18 |
| TIMESTAMP | 0.6579273 | 21763.80 |
knitr::kable(multi.model.3, align = c("l","c", "c"), format = "html",
caption = paste('<b>Table 3: Overview of the multivariate model (input without long-wave radiation):</b>', "Each row is a model and contains all variables above. The last row has a higher AIC as the row above and we stoped the calculation there. The final model contains all variables from the first to the second last row."))|>
kable_classic()
| Predictors | RSQ | AIC |
|---|---|---|
| PPFD_IN | 0.4520702 | 29689.54 |
| siteid | 0.5162893 | 27592.02 |
| TA_F_MDS | 0.6011577 | 24338.67 |
| VPD_F | 0.6445535 | 22396.93 |
| SW_IN_F | 0.6497964 | 22148.19 |
| WS_F | 0.6546706 | 21913.68 |
| CO2_F_MDS | 0.6550163 | 21898.78 |
| P_F | 0.6552285 | 21890.40 |
| TIMESTAMP | 0.6552550 | 21891.10 |
knitr::kable(multi.model.4, align = c("l","c", "c"), format = "html",
caption = paste('<b> Table 4: Overview of the multivariate model (input without siteid and timestamp):</b>', "Each row is a model and contains all variables above. The last row has a higher AIC as the row above and we stoped the calculation there. The final model contains all variables from the first to the second last row."))|>
kable_classic()
| Predictors | RSQ | AIC |
|---|---|---|
| PPFD_IN | 0.4520702 | 29689.54 |
| LW_IN_F | 0.5301754 | 27096.52 |
| VPD_F | 0.5746412 | 25420.80 |
| TA_F_MDS | 0.5881248 | 24879.25 |
| SW_IN_F | 0.5918328 | 24728.65 |
| P_F | 0.5945354 | 24618.55 |
| WS_F | 0.5963983 | 24542.84 |
| CO2_F_MDS | 0.5981106 | 24473.10 |
| PA_F | 0.5982491 | 24469.29 |
| TA_F | 0.5982635 | 24470.68 |
We visualize the development of the model. For that, we create a plot with two different y-axes. We can see that the model with more variables has a higher RSQ. That is not a surprise because RSQ will always increase if we add a variable. Important is the AIC. It decreases with each added variable until it increases again. That is the point at which we stopped our algorithm.
We calculated three models (one with all variables but the target (table 2), one without the target and long-wave-radiation (table 3), and one without the target, sited, and time (table 4)). We compare the three models and we can see that the AIC is lower for the probably best model (21762.2 vs. 21890.4 vs. 24469.3). The RSQ is higher (0.6579 vs. 0.6552 vs. 0.5983). Model 2 is therefore better than model 3 or 4.
# Load the function
source("../../R/re_stepwise/function.vis.and.dev.multi.R")
# Function call
vis.and.dev.multi.model(multi.model.2)
vis.and.dev.multi.model(multi.model.3, c("[without incoming long-wave]"))
vis.and.dev.multi.model(multi.model.4, c("[without siteid and TIMESTAMP]"))
There are no meaningful scatter-plots for multidimensional models. But we want to know more about the coefficients, the intercept, the distribution, and the outliers. We make some calls by using summary(). We get the same values for the RSQ. This is proof that our algorithm works. We see that not all coefficients are significant. We set the code chunk on eval=false if you interest, you can run it, if you set eval=true.
# Calculate the probably best multivariate model
best.multi.lm <- lm(GPP_NT_VUT_REF~PPFD_IN + LW_IN_F + VPD_F + TA_F_MDS +
SW_IN_F + P_F + WS_F + CO2_F_MDS + PA_F + siteid + TIMESTAMP,
data = database)
second.multi.lm <- lm(GPP_NT_VUT_REF~PPFD_IN + VPD_F + TA_F_MDS +
SW_IN_F + P_F + WS_F + CO2_F_MDS + PA_F + siteid + TIMESTAMP,
data = database)
third.multi.lm <- lm(GPP_NT_VUT_REF~PPFD_IN + LW_IN_F + VPD_F + TA_F_MDS +
SW_IN_F + P_F + WS_F + CO2_F_MDS + PA_F,
data = database)
# Plot the model to get a overview about the residue
plot(best.multi.lm)
If you interest in a summary, you can run this code chunk manually (set eval=true and it will work).
# Overview about all important model parameter
summary(best.multi.lm)
summary(second.multi.lm)
summary(third.multi.lm)
It seems that the step-forward algorithm has been successfully implemented. However, it is important to choose the predictors very carefully because the final model could vary. The target, of course, must be removed before we can use the algorithm because if we use the target as a predictor, RSQ will always be 1. Additionally, we must be aware of linear combinations. The linear combination could disrupt our results. We must always know what each variable stands for. After that, we must decide whether the variable is a meaningful predictor. For example, a time factor could be very important. Maybe the experiment setting was outdoors and the climate factors were important. It could also be possible that our experimental setting was indoors and there were always labor conditions. That means, that the date and time are irrelevant. It is not always clear what a “good choice” is.
For task 1, we could have written a very similar function as for task 2. But we wrote a function that is a little bit more complicated than necessary. We did this to demonstrate how the algorithm works. The algorithm calculates a linear regression model for each predictor. With the table, we are able to see that the model with the highest RSQ is not necessarily the model with the lowest AIC.
We also wrote a function for task two. It is a step-forward algorithm. First, the function calculates a bivariate linear regression model and chooses the one with the highest RSQ. After that, a multivariate regression model is calculated, and again, the one with the highest RSQ is chosen. For each round, we compare the AIC. If the AIC is higher than the model before, we stop the calculation, and we have probably found our best-fit model. But here we have the same problem as described above. Our calculation depends on our input. Therefore, we need to consider which variables we include in the calculation.
For demonstration, we made some function calls. We can easily see that the results are different. That means we must (1) choose wisely (maybe with an educational guess), (2) try different calls to be able to estimate how big the difference is, and (3) document how and why we have what decided.
Inspired by: https://www.statology.org and the book: “Statistik” form L. Fahrmeir
Residuals vs. fitted plot
We can see the red line follows almost the dashed line. It is near zero and approximately a horizontal line. That means the linearity is not violated and the assumption that we use a linear regression model was ok. But we also see that for the lowest and the highest value, the linearity is a bit violated (especially for high values).
Q-Q-Plot
The Q-Q-Plot shows us that the residuals are almost following a normal distribution. The error is probably mainly statistical and not systematic. But for the highest and the lowest values, the derivation become bigger. That means, that we probably have a systematic error for high and low values (especially for high values). The plot also shows that the choice of a linear model was the right one because it follows almost the line with a 45° angle (the values would then have a perfect normal distribution).
Scale-Location Plot
The red line should be roughly horizontal across the plot. If so, then the assumption of homoscedasticity is probably right. The scattering should be random, and there should be no pattern at all. Both are approximately true. The spread of the residuals is roughly equal at all fitted values. Only for the lowest values we are able to see a bit of a pattern.
Leverage-Plot
There are no values behind the cook’s distance. That means that there are no significant outliers that distort our analysis.
The plots are approximately the same as for the bivariate model and need no more discussion. Only the scale-location plot needs a few words: The red line is not horizontal. That means there could be a problem with homoscedasticity. But it is not so dramatic and does not need an intervention. But we should keep this in mind.
We can see that RSQ increases quickly for the first few steps. But then it slows down, and suddenly there is no further improvement, and we can stop our model. For the AIC, we can see similar behavior. But AIC decreases instead of increasing. First, the decrease is fast and slows down. In the last step, the AIC increased a bit. That is when we stop our calculation.
The RSQ is based on the variance decomposition theorem, which states that the variance of the dependent variable can be written as the sum of a part of the variance explained by the regression model and the variance of the residuals (unexplained variance). By definition, it increases with additional variables, but the growth flattens out. Actually, we should discuss the adjusted RSQ instead of the RSQ. The RSQ is always a bit higher than the adjusted RSQ. But it does not matter which one we use in the algorithm. The algorithm stops if the AIC increases. If we choose a final model, then we have to use the adjusted RSQ.
It is also important to note that RSQ does not simply add up but must be recalculated constantly. This means that for a multivariate model, we cannot simply use the highest RSQ of the individual bivariate models or the variables with the highest correlation coefficients. This is visible in the figures “visualization of the development of RSQ and AIC (bivariate). That is why we have to use the AIC for a good model choice, and that is also why we recalculate the RSQ in our step-forward algorithm in each round.
We can consider why a particular variable was included in the model and why another variable was not. For this we need a close look at the variables. It is important to know which variables we used and why.
For the bivariate model, the highest RSQ reached the variable PPFD_IN (Photosynthetic photon flux density, incoming). This makes sense since the target was GPP_NT_VUT_REF (Gross Primary Productivity). The GPP depends, among other things, on photosynthetic flux density. The second and third-best models used variables of short-wave radiation. This also makes sense because photosynthesis depends on short-wave radiation. In short-wave radiation, the photons have more energy than in long-wavelength radiation. Moreover, the energy is transported in photons. That means not the intensity of the light is crucial for photosynthesis, but the photon energy. The photon energy is a function of the frequency. Long-wave radiation has less energy and contributes almost nothing to photosynthesis. This is why we do not have a correlation between GPP and long-wave variables. TA is the air temperature and VPD is the vapor pressure. They contribute a bit to the GPP but they are not as important as SW-variables or PPFD-variables.
We see the same in our multivariate model. In the first multiple-variable model, we chose all variables. The first variable is PPFD_IN again (what confirms the correctness of the algorithm because it is the same as for the bivariate model). It may be surprising that the long-wave variable was chosen second by the algorithm. It seems that it is important after all. But this is deceptive. In the second model, we exclude the long-wave variable. The metrics changed only insignificantly (RSQ from 0.6579 to 0.6552 and AIC from 21762 to 21780). That means that the variable is not important for modeling, and our argument is probably right.
But which variable we chose is very important. We can see that in our third model. We exclude the variables time and location, and RSQ and AIC are very different from the first two models (from about 0.65 to about 0.598 and from about 21600 to 24470, respectively). That proves to us that we are also right here with our considerations.